Skip to content

[AutoSparkUT] Fix LIKE with invalid escape pattern to match CPU behavior#14388

Open
wjxiz1992 wants to merge 3 commits intoNVIDIA:mainfrom
wjxiz1992:fix/14117-like-escape-validation
Open

[AutoSparkUT] Fix LIKE with invalid escape pattern to match CPU behavior#14388
wjxiz1992 wants to merge 3 commits intoNVIDIA:mainfrom
wjxiz1992:fix/14117-like-escape-validation

Conversation

@wjxiz1992
Copy link
Collaborator

@wjxiz1992 wjxiz1992 commented Mar 10, 2026

Closes #14117

Summary

  • Validate LIKE escape patterns on GPU to match CPU semantics. Previously GpuLike passed patterns directly to cuDF, which silently accepted invalid escape sequences (e.g. LIKE 'm%@ca' ESCAPE '%'), returning wrong results instead of throwing AnalysisException like CPU does.
  • Add tagExprForGpu in BinaryExprMeta[Like] with a focused O(n) escape-char validation that detects invalid patterns and falls back to CPU, which throws the correct AnalysisException.
  • Remove SPARK-33677 exclusion from RapidsTestSettings.

UT Traceability

Item Detail
RAPIDS suite RapidsSQLQuerySuite
Spark original test "SPARK-33677: LikeSimplification should be skipped if pattern contains any escapeChar"
Spark source file sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Line range L3758–L3773
Source (master) SQLQuerySuite.scala#L3758-L3773
Source (pinned) SQLQuerySuite.scala@f66c336#L3758-L3773

Performance Impact

Negligible. The validation runs once during query planning (not per-row), only on the literal pattern string. It is a simple O(n) character scan with no allocations — no regex construction, no Pattern.quote calls. Typical LIKE patterns are a few dozen characters. There is no change to the per-batch GPU execution path (GpuLike.doColumnar).

Test Plan

  • mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=...RapidsSQLQuerySuite — 215 passed, 0 failed, 18 ignored
  • Pre-merge CI (defaults unchanged, expected no regression)

Checklists

  • This PR has added documentation for new or modified features or
    behaviors.
  • This PR has added new tests or modified existing tests to cover
    new code paths.
    (The SPARK-33677 exclusion is removed, re-enabling the existing Spark test
    "SPARK-33677: LikeSimplification should be skipped if pattern contains any escapeChar"
    in RapidsSQLQuerySuite, which covers the new tagExprForGpu validation path.)
  • Performance testing has been performed and its results are added
    in the PR description. Or, an issue has been filed with a link in the PR
    description.

Made with Cursor

Copilot AI review requested due to automatic review settings March 10, 2026 04:15
@wjxiz1992
Copy link
Collaborator Author

build

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns GPU LIKE ... ESCAPE ... behavior with Spark CPU semantics by validating escape patterns and avoiding GPU execution for invalid patterns that should raise an AnalysisException.

Changes:

  • Add GPU planning-time validation for LIKE escape patterns and fall back to CPU on invalid patterns.
  • Add executor-side defensive validation in GpuLike.doColumnar as a safety net.
  • Re-enable the previously excluded Spark 3.3.0 unit test for SPARK-33677 in RapidsTestSettings.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Removes the SPARK-33677 exclusion so the test runs again.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Adds one-time runtime validation of the LIKE pattern escape semantics inside GpuLike.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Adds tagExprForGpu validation for Like to detect invalid escape patterns and force CPU fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR fixes GpuLike to match CPU semantics when a LIKE pattern contains an invalid escape sequence (e.g. LIKE 'm%@ca' ESCAPE '%'). Previously cuDF silently accepted such patterns and returned wrong results; now the GPU planner detects them at planning time and falls back to CPU, which throws the correct AnalysisException. The SPARK-33677 test exclusion (issue #14117) is removed as a result.

Key changes:

  • Adds tagExprForGpu to BinaryExprMeta[Like] in GpuOverrides.scala that iterates the literal pattern, detects invalid escape sequences (escape char not followed by _, %, or the escape char itself, or trailing escape char), and calls willNotWorkOnGpu to trigger CPU fallback.
  • Removes the SPARK-33677: LikeSimplification should be skipped if pattern contains any escapeChar exclusion from RapidsTestSettings.scala since the bug is now fixed.
  • The PR description mentions adding defensive validation inside GpuLike.doColumnar as a safety net, but this was not included in the diff — stringFunctions.scala is unchanged. For the known code paths this does not expose a regression, but the PR description is inaccurate on this point.

Confidence Score: 4/5

  • Safe to merge — the validation logic is correct and the fallback to CPU is the right behaviour for invalid escape patterns.
  • The escape-validation logic in tagExprForGpu is correct: it walks the literal pattern, detects trailing escape chars and chars not in {_, %, escape} after an escape, and falls back to CPU. The parent's tagExprForGpu is empty so not calling super is fine and consistent with the rest of the codebase. The one deduction is that the PR description claims a doColumnar safety net was added but it was not; while this is not a runtime regression today (all invalid patterns are caught at plan time or rejected by ExprChecks), the description is misleading.
  • No files require special attention beyond the noted discrepancy between the PR description and the absence of defensive validation in GpuLike.doColumnar.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Adds tagExprForGpu override for Like to validate escape sequences at planning time and fall back to CPU for invalid patterns; logic is correct but PR description mentions a doColumnar safety net that was not implemented.
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Removes SPARK-33677 exclusion (tracked as issue #14117) which is now fixed by the escape-pattern validation fallback.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LIKE expression reaches GPU planner] --> B{a.right is Literal UTF8String?}
    B -- No --> E[case _ : no validation, ExprChecks already rejects non-literals]
    B -- Yes --> C[Iterate pattern chars with tagExprForGpu]
    C --> D{Current char == escapeChar?}
    D -- No --> F[Advance i, continue]
    F --> D
    D -- Yes --> G{Next char exists?}
    G -- No trailing escape --> H[willNotWorkOnGpu → CPU fallback → AnalysisException]
    G -- Yes --> I{Next char in '_', '%', escapeChar?}
    I -- No invalid escape --> H
    I -- Yes valid escape --> J[Advance i by 2, continue]
    J --> D
    C --> K[All chars valid → convertToGpu → GpuLike.doColumnar → cuDF like]
Loading

Comments Outside Diff (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala, line 3529-3530 (link)

    PR description mentions unimplemented safety net

    The PR summary explicitly states: "Add defensive validation in GpuLike.doColumnar as a safety net." However, stringFunctions.scala was not changed — GpuLike.doColumnar still passes the pattern directly to cuDF without any validation. The tagExprForGpu fallback handles all planning-time cases (literal patterns are caught; non-literal patterns are rejected by ExprChecks), so there is no known runtime path that would reach doColumnar with an invalid escape sequence today. If the intent was to add belt-and-suspenders protection (e.g., against future code paths that construct GpuLike without going through the planner), that safety net should be added to doColumnar as described, or the PR description should be updated to reflect that it was intentionally omitted.

Last reviewed commit: bd7f8eb

@wjxiz1992
Copy link
Collaborator Author

build

…ior (NVIDIA#14117)

Validate LIKE escape patterns on GPU to match CPU semantics. Previously,
GpuLike passed patterns directly to cuDF which silently accepted invalid
escape sequences (e.g. LIKE 'm%@ca' ESCAPE '%'), returning wrong results
instead of throwing AnalysisException like CPU does.

- Add tagExprForGpu in BinaryExprMeta[Like] to detect invalid escape
  patterns and fall back to CPU, which throws the correct exception.
- Add defensive validation in GpuLike.doColumnar as a safety net.
- Remove SPARK-33677 exclusion from RapidsTestSettings.

Closes NVIDIA#14117

Maven validation:
  mvn test -pl tests -Dbuildver=330 \
    -DwildcardSuites=...RapidsSQLQuerySuite
  Tests: succeeded 215, failed 0, ignored 18

Signed-off-by: Allen Xu <allxu@nvidia.com>
Made-with: Cursor
- Use NonFatal(e) instead of catching Exception in tagExprForGpu to
  avoid swallowing fatal errors; include e.getMessage for diagnostics.
- Check l.value.isInstanceOf[UTF8String] instead of l.dataType ==
  StringType to correctly handle CharType/VarcharType whose values
  are UTF8String at runtime.
- Use rhs.getValue.toString in GpuLike.doColumnar to safely handle
  both UTF8String and String backed scalars.

Signed-off-by: Allen Xu <allxu@nvidia.com>
Made-with: Cursor
@wjxiz1992 wjxiz1992 force-pushed the fix/14117-like-escape-validation branch from d4191a3 to 0e1b4ad Compare March 10, 2026 07:30
@wjxiz1992
Copy link
Collaborator Author

build

@wjxiz1992 wjxiz1992 self-assigned this Mar 11, 2026
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the performance impact of this change? All PRs are supposed to have a performance impact unless it is something like a doc or comment only change.

if l.value.isInstanceOf[UTF8String] =>
StringUtils.escapeLikeRegex(
l.value.toString, a.escapeChar)
case _ =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is okay if we get a literal that is not a string? or is this more that constant folding was disabled?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this path: case _ => because "search" only accept literal, it can not be other type.

("search", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING))

Copy link
Collaborator Author

@wjxiz1992 wjxiz1992 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, updated to use a pattern match for it.

a.right match {
case l: Literal
if l.value.isInstanceOf[UTF8String] =>
StringUtils.escapeLikeRegex(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does a lot of work.

https://github.com/apache/spark/blob/104e43b75f4f106b03c63f0f0e5fd18d69b1cdf9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L36-L70

Can we please look at what it would take to validate this in a faster way. The main reason I ask is because we want to be able to support cudf's like with a columnar pattern and if we don't understand it validating the pattern on the GPU with the full column is going to be a pain.

Copy link
Collaborator Author

@wjxiz1992 wjxiz1992 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Now replace it with a light-weight logic to checks the two invalid cases defined in Spark's Like expression doc ("It is invalid to escape any other character") and implemented by StringUtils.escapeLikeRegex:

  1. Escape char at end of pattern (no char to escape)
  2. Escape char followed by a char that is not _, %, or the escape char itself

I'm not sure if this should also be put in a specific function?


override def doColumnar(lhs: GpuColumnVector, rhs: GpuScalar): ColumnVector = {
if (!escapeValidated && rhs.isValid) {
StringUtils.escapeLikeRegex(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to validate this here too? Should we cache if we did this in planning so we don't have to do it multiple times?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, no need to check at runtime, removed.

@res-life
Copy link
Collaborator

Add defensive validation in GpuLike.doColumnar as a safety net.

Not needed. Because there will be no GpuLike if validation is failed.

@res-life
Copy link
Collaborator

Add the checklist in the desc of this PR:
### Checklists

- [ ] This PR has added documentation for new or modified features or
behaviors.
- [ ] This PR has added new tests or modified existing tests to cover
new code paths.
(Please explain in the PR description how the new code paths are tested,
such as names of the new/existing tests that cover them.)
- [ ] Performance testing has been performed and its results are added
in the PR description. Or, an issue has been filed with a link in the PR
description.

- Replace StringUtils.escapeLikeRegex call with a focused O(n) escape
  char validation that only checks the two invalid cases (escape char
  at end of pattern, escape char followed by non-special character).
  This avoids building a full regex string during planning.
- Remove runtime safety net in GpuLike.doColumnar — tagExprForGpu
  already prevents GpuLike from being created for invalid patterns.
- Remove now-unused StringUtils import from stringFunctions.scala.

Signed-off-by: Allen Xu <allxu@nvidia.com>
Made-with: Cursor
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992
Copy link
Collaborator Author

What is the performance impact of this change? All PRs are supposed to have a performance impact unless it is something like a doc or comment only change.

Update the PR description for performance impact with latest change according to review comments. Now only checks at planning stage once with a O(n) logic, where Like strings are usually not that super long.

@wjxiz1992
Copy link
Collaborator Author

@revans2 I'm trying to add rules to LLM that we should conduct performance regression check after touching any core functional codes. My question is, what kind of performance check are proper for these changes.

I can think of 2 types in general:

  • NDS
  • micro benchmarks

For NDS, I saw some colleagues do it when it comes to like shuffle optimization, join optimization, rules optimizations etc.

For micro benchmarks, I saw cuDF has a lot for it, also I recall some micro-benchmarks designed for like long string operations in spark-rapids.

Use this regex change for example, I prefer micro benchmark for it instead of NDS if it really needs a performance regression check. What's your suggestion on this top? A general rule to choose what type of performance checks to apply?

@revans2
Copy link
Collaborator

revans2 commented Mar 12, 2026

@wjxiz1992 we need to use different benchmarks for different reasons and situations. If the change targets a specific set of functionality ( an expression for example, or columnar to row modification), then a targeted micro benchmark can isolate the impact of the change. If the change is general, or the micro benchmark showed a significant regression then a more real world benchmark is helpful to either put the regression in context of real world processing or to exercise the broad change. But we have to be sure that that "real world" benchmark exercises the code we care about. NDS does not fall back to the CPU for any processing. That is good to show shuffle, but not columnar to row changes. NDS does not use maps, or arrays, or structs. So it is not good to test them. It hardly uses strings like real world queries do, so it is bad at that too. We often come up with other more realistic queries or use inspiration from customer queries we know of to help build some of these types of tests.

In general I want to think about how this change could impact the performance of the system and then build tests that would cover that case. Here we have escape checks that happen at planning time and escape checks that are happening at run time. The run time checks happen on a per-batch basis. So I probably want a few different tests. Some that are more malicious and some more realistic.

The "real world" test would be something like. spark.time(spark.range(0, 4294967296L, 1, 32).selectExpr("CAST(id as STRING) as sid").filter("SID LIKE '%B1_B1%' ESCAPE 'B'").count) I didn't check the syntax, but it should give about a 1 GiB batch to each task and give us 32 tasks to work with (assuming that we have a 1 GiB target batch size and we have 8 or 16 CPU cores to work with for spark running in local mode). This way we get parallelism on the GPU and multiple waves. I typically will capture a history file with this too and compare the optime for the operator that includes the change (probably filter in this case). For this case you have to have some knowledge of how Spark deal with the like operator. If the pattern is static then it may optimize some patterns into other operators. So we cannot do simple things, for example '%1%' is converted into a contains.

If I want to be malicious, then I start to change things around. We can set the target batch size much smaller (10 MiB for example so we get about 100x the number of batches) and see if the runtime check at the beginning of each batch has a measurable impact for before and after.

We could also do a lot of 2 or three row queries in a loop to see if the planning change has a measurable impact.

If we test what we think is the worst case situations and there is no measurable impact then we know it is not a big deal.

For this case I would not likely use NDS at all. There are few LIKE operations in it. There are few strings too, and the size of the string data isn't large, also most of the patterns that NDS uses are ones that Spark will rewrite to not use LIKE directly at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AutoSparkUT]"SPARK-33677: LikeSimplification should be skipped if pattern contains any escapeChar" in SQLQuerySuite failed

4 participants